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Abstract: The paper proposes a method for constructing a sparse estima- 
tor for the inverse covariance (concentration) matrix in high-dimensional 
settings. The estimator uses a penalized normal likelihood approach and 
forces sparsity by using a lasso-type penalty. We establish a rate of con- 
vergence in the Frobenius norm as both data dimension p and sample size 
n are allowed to grow, and show that the rate depends explicitly on how 
sparse the true concentration matrix is. We also show that a correlation- 
based version of the method exhibits better rates in the operator norm. We 
also derive a fast iterative algorithm for computing the estimator, which 
relies on the popular Cholesky decomposition of the inverse but produces 
a permutation-invariant estimator. The method is compared to other es- 
timators on simulated data and on a real data example of tumor tissue 
classification using gene expression data. 
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1. Introduction 

Estimation of large covariance matrices, particularly in situations where the data 
dimension p is comparable to or larger than the sample size n, has attracted a lot 
of attention recently. The abundance of high-dimensional data is one reason for 
the interest in the problem: gene arrays, fMRI, various kinds of spectroscopy, cli- 
mate studies, and many other applications often generate very high dimensions 
and moderate sample sizes. Another reason is the ubiquity of the covariance 
matrix in data analysis tools. Principal component analysis (PCA), linear and 
quadratic discriminant analysis (LDA and QDA), inference about the means of 
the components, and analysis of independence and conditional independence in 
graphical models all require an estimate of the covariance matrix or its inverse, 
also known as the precision or concentration matrix. Finally, recent advances 
in random matrix theory - see Johnstone (2001) for a review, and also Paul 
(2007) - allowed in-depth theoretical studies of the traditional estimator, the 
sample (empirical) covariance matrix, and showed that without regularization 
the sample covariance performs poorly in high dimensions. These results helped 
stimulate research on alternative estimators in high dimensions. 

Many alternatives to the sample covariance matrix have been proposed. A 
large class of methods covers the situation where variables have a natural or- 
dering, e.g., longitudinal data, time series, spatial data, or spectroscopy. The 
implicit regularizing assumption underlying these methods is that variables far 
apart in the ordering have small correlations (or partial correlations, if the object 
of regularization is the concentration matrix). Methods for regularizing covari- 
ance by banding or tapering have been proposed by Bickcl and Lcvina (2004) 
and Furrer and Bengtsson (2007). Bickel and Levina (2008) showed consistency 
of banded estimators in the operator norm under mild conditions as long as 
(logp)/n 0, for both banding the covariance matrix and the Cholesky factor 
of the inverse discussed below. 

When the inverse of the covariance matrix is the primary goal and the vari- 
ables are ordered, regularization is usually introduced via the modified Cholesky 
decomposition, 

= L^D-^L. 

Here L is a lower triangular matrix with Ijj ~ 1 and Ijji = —4>jj', where 
(pjj' ; j' < j is the coefficient of Xji in the population regression of Xj on 
Xi, . . . , Xj_i, and D is a diagonal matrix with residual variances of these re- 
gressions on the diagonal. Several approaches to regularizing the Cholesky factor 
L have been proposed, mostly based on its regression interpretation. A fc-banded 
estimator of L can be obtained by regressing each variable only on its closest k 
predecessors; Wu and Pourahmadi (2003) proposed this estimator and chose k 
via an AIC penalty. Bickel and Levina (2008) showed that banding the Cholesky 
factor produces a consistent estimator in the operator norm under weak con- 
ditions on the covariance matrix, and proposed a cross-validation scheme for 
picking k. Huang ct al. (2006) proposed adding either an I2 (ridge) or an li 
(lasso) penalty on the elements of L to the normal likelihood. The lasso penalty 
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creates zeros in L in arbitrary locations, which is more flexible than banding, 
but (unlike in the case of banding) the resulting estimate of the inverse may not 
have any zeros at all. Le\'iiia ct al. (2008) proposed adaptive banding, which, by 
using a nested lasso penalty, allows a different k for each regression, and hence 
is more flexible than banding while also retaining some sparsity in the inverse. 
Bayesian approaches to the problem introduce zeros via priors, either in the 
Cholesky factor (Smith and Kohn, 2002) or in the inverse itself (Wong ct al., 

2003) . 

There are, however, many applications where an ordering of the variables 
is not available: genetics, for example, or social and economic studies. Meth- 
ods that are invariant to variable permutations (like the covariance matrix it- 
self) are necessary in such applications. Regularizing large covariance matrices 
by Steinian shrinkage of eigenvalues has been proposed early on (HafF, 1980; 
Dey and Srinivasan, 1985). More recently, Ledoit and Wolf (2003) proposed a 
way to compute an optimal linear combination of the sample covariance with 
the identity matrix, which also results in shrinkage of eigenvalues. Shrinkage es- 
timators are invariant to variable permutations but they do not affect the eigen- 
vectors of the covariance, only the eigenvalues, and it has been shown that the 
sample eigenvectors are also not consistent when p is large (Johnstone and Lu, 

2004) . Shrinking eigenvalues also does not create sparsity in any sense. Some- 
times alternative estimators are available in the context of a specific application 
- e.g., for a factor analysis model with known factors Fan ct al. (2008) develop 
regularized estimators for both the covariance and its inverse. 

Our focus here will be on sparse estimators of the concentration matrix. 
Sparse concentration matrices are widely studied in the graphical models litera- 
ture, since zero partial correlations imply a graph structure. The classical graph- 
ical models approach, however, is different from covariance estimation, since it 
normally focuses on just finding the zeros. For example, Drton and Pcrlnian 
(2008) develop a multiple testing procedure for simultaneously testing hypothe- 
ses of zeros in the concentration matrix. There are also more algorithmic ap- 
proaches to finding zeros in the concentration matrix, such as running a lasso re- 
gression of each variable on all the other variables (Meinshausen and Biihlmann, 
2000), or the PC-algorithm (Kalisch and Biihlmann, 2007). Both have been 
shown to be consistent in high-dimensional settings, but none of these methods 
supply an estimator of the covariance matrix. In principle, once the zeros are 
found, a constrained maximum likelihood estimator of the covariance can be 
computed (Chaudhuri et al., 2007), but it is not clear what the properties of 
such a two-step procedure would be. 

Two recent papers, d'Asprcniont ct al. (2008) and Yuan and Lin (2007), take 
a penalized likelihood approach by applying an li penalty to the entries of the 
concentration matrix. This results in a permutation-invariant loss function that 
tends to produce a sparse estimate of the inverse. Yuan and Lin (2007) used 
the max-det algorithm to compute the estimator, which limited their numerical 
results to values of p < 10, and derived a fixed p, large n convergence result. 
d'Asprcmont ct al. (2008) proposed a much faster semi-definite programming 
algorithm based on Nesterov's method for interior point optimization. While 
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this paper was in review, a new very fast algorithm for the same problem was 
proposed by Friedman et al. (2008), which is based on the coordinate descent 
algorithm for the lasso (Friedman et al., 2007). 

In this paper, we analyze the estimator resulting from penalizing the normal 
likelihood with the penalty on the entries of the concentration matrix (we will 
refer to this estimator as SPICE - Sparse Permutation Invariant Covariance Es- 
timator) in the high-dimensional setting, allowing both the dimension p and the 
sample size n to grow. We give an explicit convergence rate in the Frobenius 
norm and show that the rate depends on how sparse the true concentration 
matrix is. For a slight modification of the method based on using the sample 
correlation matrix, we obtain the rate of convergence in operator norm and show 
that it is essentially equivalent to the rate of thresholding the covariance matrix 
itself obtained in Bickel and Lcvina (2007). We also derive our own optimiza- 
tion algorithm for computing the estimator, based on Cholesky decomposition 
and the local quadratic approximation. Unlike other estimation methods that 
rely on the Cholesky decomposition, our algorithm is invariant under variable 
permutations. Because we use the local quadratic approximation, the algorithm 
is equally applicable to general Iq penalties on the entries of the inverse, not 
just li. 

The rest of the paper is organized as follows: Section 2 summarizes the SPICE 
approach in general, and presents consistency results. The Cholesky-based com- 
putational algorithm, along with a discussion of optimization issues, is presented 
in Section 3. Section 4 presents numerical results for SPICE and a number of 
other methods, for simulated data and a real example on classification of colon 
tumors using gene expression data. Section 5 concludes with discussion. 

2. Analysis of the SPICE method 

We assume throughout that we observe Xi, . . . , Xn, i.i.d. p-variate normal 
random variables with mean and covariance matrix Eq, and write Xi = 
{Xii, . . . , Xip)'^ . Let Eo = [croij], and flo = Eq ^ be the inverse of the true co- 
variance matrix. For any matrix M = [mij], we write \M\ for the determinant 
of M, tr(M) for the trace of M, and ipmax{M) and (^inin(Af) for the largest 
and smallest eigenvalues, respectively. We write Af + — diag(M) for a diagonal 
matrix with the same diagonal as M, and = M — In the asymptotic 
analysis, we will use the Frobenius matrix norm ||Af|||, = ^^'^ 
operator norm (also known as matrix 2-norm), ||M|p = ip^naxiMM'^). We will 
also write | • |i for the li norm of a vector or matrix vectorized, i.e., for a matrix 
|M|i=E..,|m,,|. 

It is easy to see that under the normal assumption the negative log-likelihood, 
up to a constant, can be written in terms of the concentration matrix as 



f(Xi,...,X„;r!) =tr(r!E)-log|l)|, 
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where 

1 " 

S = -V(X,-X)(X,-X)^ 
n ^ — ' 

i=i 

is the sample covariance matrix. 

We define the SPICE estimator JIa of the inverse covariance matrix as the 
minimizcr of the penalized negative log-likelihood, 

= argmin{tr(17E) -log|17| + A|r2-|i} (1) 

where A is a non-negative tuning parameter, and the minimization is taken over 
symmetric positive definite matrices. 

SPICE is identical to the lasso-type estimator proposed by Yuan and Lin 
(2007), and very similar to the estimator of d'Asprcmont ct al. (2008) (they 
used |ri|i rather than in the penalty). The loss function is invariant to 

permutations of variables and should encourage sparsity in $7 due to the l\ 
penalty applied to its off-diagonal elements. 

We make the following assumptions about the true model: 



Al 
A2 
A3 



Let the set S = {(z, j) : fioy 7^ 0, i 7^ 3). Then card(5) < s. 
<Pmin(So) > 4 > Oi or equivalently i^max(f^o) < 

<y5max(So) < k. 



Note that assumption A2 guarantees that fio exists. Assumption Al is more 
of a definition, since it does not stipulate anything about s (s = p(p — l)/2 
would give a full matrix). 

Theorem 1. Let Sl\ he the minimizer defined by (1). Under Al, A2, A3, if 



\n,-n,\\, = oJynP±^}^]. (2) 



The theorem can be restated, more suggestively, as 

\\CIx-^o\\f f f, , s\logp 



The reason for the second formulation (3) is the relation of the Frobenius norm 
to the operator norm, ||M|||/p < ||M||2 < \\M\\j,. 

Before proceeding with the proof of Theorem 1, we discuss a modification 
to SPICE based on using the correlation matrix. An inspection of the proof 
reveals that the worst part of the rate, i/plogp/n, comes from estimating the 
diagonal. This suggests that if we were to use the correlation matrix rather than 
the covariance matrix, we should be able to get the rate of \/ s \ogp/n. Indeed, 
let Eq = VFFVF, where F is the true correlation matrix, and W is the diagonal 
matrix of true standard deviations. Let W and F be the sample estimates of W 
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and r, i.e., = E+, f = W'^tW'^. Let K = T-\ Define a SPICE estimate 
of by 

lix = argniin{tr(r2f) -log|Jl| + A|f7-|i} (4) 

Then we can define a modified correlation-based estimator of the concentration 
matrix by 

nx=::W-^KxW-\ (5) 

It turns out that in the Frobcnius norm has the same rate as f2, but for Q we 
can get a convergence rate in the operator norm (matrix 2-norm). As discussed 
previously by Bickel and Levina (2008), El Karoui (2007) and others, the oper- 
ator norm is more appropriate than the Frobcnius norm for spectral analysis, 
e.g., PCA. It also allows for a direct comparison with banding rates obtained in 
Bickel and Levina (2008) and thresholding rates in Bickel and Levina (2007). 

Theorem 2. Under assumptions of Theorem 1, 



(s + l)logp 



Note. This rate is very similar to the rate for thresholding the covariance 
matrix obtained by Bickel and Levina (2007). They showed that under the as- 
sumption maxi lo-yl' < co(p) for < g < 1, if the sample covariance entries 

are set to when their absolute values fall below the threshold X ~ M\' 



then the resulting estimator converges to the truth in operator norm at the rate 

no worse than Op ^Co(p) (^^^) Since the truly sparse case corresponds 

to g = 0, and Co(j>) is a bound on the number of non-zero elements in each 
row, and thus ^/s x co(p), this rate coincides with ours, even though the es- 
timator and the method of proof are very different. However, Lemma 1 below 
is the basis of the proof in both cases, and ultimately it is the bound (6) that 
gives rise to the same rate. A similar rate has been obtained for banding the 
covariance matrix in Bickel and Levina (2008), under an additional assumption 
that depends on the ordering of the variables and is not applicable here (see 
Bickel and Levina (2007) for a comparison between banding and thresholding 
rates). 

In the proof, we will need a lemma of Bickel and Levina (2008) (Lemma 3) 
which is based on a large deviation result of Sauhs and Statulevicius (1991). We 
state the result here for completeness. 

Lemma 1. Let Zi be i.i.d. A/'(0, Sp) and ipmnxC^p) < < oo. Then, ifSp = 
[o-ab], 



where ci, ci and S depend on k only. 



< ci exp(— C2n;^^) for < 6 (6) 
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Proof of Theorem 1. Let 

Q{n) =tr(f7E) - log|fi| + A|fi"|i - tr(r2oS) + log |17o| - X\n^\i 
=tr[(r! - no){t - So)] - (log \n\ - log ll^ol) 

+ tr[(17-17o)So] +A(|r!-|i-|r!o|i) (7) 

Our estimate Q minimizes Q{il), or equivalently A = 17 — l^o minimizes G{A) = 
Q{flo + A). Note that we suppress the dependence on A in f2 and A. 
The main idea of the proof is as follows. Consider the set 

e„(M) = {A : A = A^, \\A\\f = Mr„}, 

where 

V n 

Note that G{A) = Q{^lo + A) is a convex function, and 

G(A) < G(0) = 0. 

Then, if we can show that 

inf{G(A) : A e e„(M)} > 0, 

the minimizer A must be inside the sphere defined by 0„(A/), and hence 

||A||f <Mr„. (8) 

For the logarithm term in (7), doing the Taylor expansion of f{t) — log jil + tA\ 
and using the integral form of the remainder and the symmetry of A, Sq; and 
fto gives 



log|l]o+A|-log|17o| =tr(SoA)-A^ 



1 

{l-v){no+vAy^(g){no+vAy^dv 



A 



(9) 

where (g) is the Kronecker product (if A = [aylpixqi, B ~ [bki]p2xq2i then 
AiSi B = [a.ijbki]pj^p2xqiq2)i a-iid A is A vectorized to match the dimensions of 
the Kronecker product. 

Therefore, we may write (7) as, 



G(A) =tr(A(S-So)) +A^ 



1 

(1 - v){no + vAy^ (g) {no + vA)-^dv 



A 



+ xi\n^ + A-\,-\n^\,) (10) 

For an index set A and a matrix M = [rriij], write Ma = [to^ J((z, j) S A)], 
where !{■) is an indicator function. Recall S ~ {{hj) ■ ^oij 7^ 0, i 7^ j} and 
let S be its complement. Note that \Qq + A^|i = + A^|i + |A^|i, and 
|ri(^|i = |ri^g|i. Then the triangular inequality implies 

Ad^^o + A-|i - li) > A(|A-|i - lA^li). (11) 
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Now, using symmetry again, we write 



|tr(A(E-Eo))| < 



^((Tij - (Toy) Ay 



I + 11. (12) 



To bound term I, note that the union sum inequahty and Lemma 1 imply 
that, with probabihty tending to 1, 



max|(Ty - (Toy I < Ci 



logp 



and hence term I is bounded by 



(13) 



The second bound comes from the Cauchy-Schwartz inequahty and Lemma 1: 



II < 



.1=1 



1/2 



l<i<p 



plogp 



A+\\f<C2 



{p + s)logp 



(14) 



also with probability tending to 1. 
Now, take 



^ ^ Ci /logp 



By (10), 



(15) 



G(A)>ifc2||A|||,-Ci 



+ A(|A-|i-|A5|i) 



Tfc'l|A|l|-Ci 



logp 



1-- |Az|i-Ci 



logp 



l + -)|A^|i 



^ , (p + .s)logp 



(16) 



The first term comes from a bound on the integral which we will argue sepa- 
rately below. The second term is always positive, and hence we may omit it for 
the lower bound. Now, note that 



lAgli < V^\\Ag\\F < V^\\A-\\f < VpT7||A-||f. 
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Thus we have 
G(A) > 



lA-lll 



4 V n 



-lllA-|l^i 



A 



|A-||| 



+ 1|2 



{p + s)\ogp 



A^ 





+ IIA+III 


"1^2 _ C, 


4- eAf 




4- M 



> (17) 



for M sufBciently large. 

It only remains to check the bound on the integral term in (10). Recall that 
'Pniin(M) = min||3.||^]^ After factoring out the norm of A, we have, for 

A € e„(M), 

/•I 1 

> / (1 - v)vLni^o + vA)-'dv > - min </.L„(f^o + vA)-' 
Jq ^ o<t)<i 

> imin{^L.(^^o + A)-i : ||A||;^ < Mr„}. 

The first inequality uses the fact that the eigenvalues of Kronecker products of 
symmetric matrices are the products of the eigenvalues of their factors. Now 



1 



<i„(f}o + A)-^ - ^,-lJ17o + A) > (||f7o!| + !|A|r^ > -k 



(18) 



with probability tending to 1, since ||A|| < ||Aj|i? = o(l). This establishes the 
theorem. □ 



As noted above, an inspection of the proof shows that y^p\ogp/n in the 
rate comes from estimating the diagonal. If we focus on the correlation matrix 
estimate Kx in (4) instead, we can immediately obtain 

Corollary 1. Under assumptions of Theorem 1, 

\\kx-K\\F = Op 



s logp 



Now we can use Corollary 1 to prove Theorem 2, the operator norm bound. 
Proof of Theorem 2. Write 

\\nx - QoW = \\w-'^KxW-'^ - w-'^Kw-'^w 

< \\W-^ - W-'^W \\Kx - K\\ \\W-'^ - I^-^ll 



\Kx\\ \\W-^\\ + \\W-^\\ \\K\ 



\K^ 



All \\w-H \\w-^\ 
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where we are using the sub-muhiphcativc norm property ||^-B|| < ||_B|| 
(see, e.g., Golub and Van Loan (1989)). Now, \\W-^\\ and are 0(1) by 

assumptions A2 and A3. Lemma 1 impUes that 

\\w'-w^^opy^y (19) 

and since \\W^'^ - x ||W^^ - W'^\\ (where by A x i? we mean A = Op{B) 

and B ~ Op{A)), we have the rate of i/logp/n for ||14^~^ — . This together 
with CoroUary 1 in turn imphes that ||Vi^~^|| and |1A'a|| are Op{l), and the 
theorem foUows. □ 

Note that in the Frobenius norm, we only have || VF^— = Op{y^p\ogp/n), 
and thus the Frobenius rate of is the same as that of 



3. The Cholesky-based SPICE algorithm 



In this section, we develop an iterative algorithm for computing the SPICE esti- 
mator using the Cholesky decomposition; however, unlike other estimators that 
depend on the Cholesky decomposition, we minimize a permutation invariant 
objective function, and thus the estimator remains permutation invariant. We 
use the quadratic approximation to the absolute value, a standard tool in op- 
timization which has been previously used in the statistics literature to handle 
lasso-type penalties, for example, by Fan and Li (2001) and Huang et al. (2006). 
In this our algorithm differs from the glasso algorithm of Friedman et al. (2008), 
which is based on a lasso algorithm and works directly on the absolute values. 
Both algorithms have computation complexity of 0{p^), but we acquire another 
small constant factor (on the order of 10) due to the additional iterations re- 
quired for the quadratic approximation to converge (see more on this in Section 
4). However, using the quadratic approximation allows us to write down the 
algorithm explicitly in general terms for an Iq penalty with g > 1, rather 

than only for q = I. In particular, our algorithm is equally applicable for use 
with a ridge penalty {q = 2), although in that special case it simplifies even 
further, or with a bridge penalty {1 < q < 2) proposed by Fu (1998), which may 
work better for certain classes of covariances. It can also be used with SCAD 
(Fan and Li, 2001) or other more complicated non-convex penalties that are 
typically approximated by the local quadratic approximation. Even though we 
derive the algorithm with a general q, in this paper we only present results for 
q = l. 

Our goal is to minimize the objective function, 

f{n) = tr(f}S) - loglf^l + A ^ loj,,,]", (20) 

where q = I corresponds to the computation of (}\ in (1). For q > I, the ob- 
jective function is convex in the elements of il and has a global minimum tl. 
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Our strategy is to re-parametrize the objective (20) using the Cholesky decom- 
position of r2 to enforce automatic positive definiteness. Rather than using the 
modified Cholesky decomposition with its regression interpretation, as has been 
standard in the hterature, we simply write 



n 



where T = [ty] is a lower triangular matrix. We can still use the regression 
interpretation if needed, by writing 



t 



33 



^33 ~ 



<l^33' 



^33 



J <J 



(21) 



where (pjj' is the coefficient of Xjr in the regression of Xj on Xi, . . . , Xj^i, and 
djj is the corresponding residual variance. 

To minimize / in terms of T, we apply a cyclical coordinate descent approach 
and minimize / with respect to one element of T at a time. Further, we use a 
quadratic approximation to /, which allows us to find the minimum of the 
univariate functions of each parameter in closed form. The algorithm is iterated 
until convergence. Here we outline the main steps of the algorithm, and leave 
the full derivation for the Appendix. 

In a slight abuse of notation, we write X for the n x p data matrix where 
each column has already been centered by its sample mean. The three terms in 
(20) can be expressed as a function of T as follows: 



tr 



^ n p / j \ 2 

("^) - -EE E^^'^^^^H (22) 
i=i j=i \ k=i / 



\og\n\ = 2j2^ogt,, 



El 



2E 

3'>3 



k=3' 



(23) 
(24) 



The quadratic approximation for is shown in (25). Since the algorithm 
is iterative, m^'^-' denotes the value of u from the previous iteration, and u'-'^"'"^^ 
is the value at current iteration. 



(25) 



Hvmtcr and Li (2005) suggest replacing | in the denominator with jit'^'^^ |-|- 
e to avoid division by zero, and refer to this as the e-perturbed quadratic approx- 
imation. This quadratic approximation to /, which we denote f^^k at iteration 
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k, allows us to easily take partial derivatives with respect to each parameter 
in T, and provides a closed form solution for the univariate minimizer for each 
coordinate. 

The algorithm requires an initial value T^^\ which corresponds to If the 
sample covariance E is non-degenerate, which is generally the case for p < n, 
one could simply set fl^^") = E~^. More generally, we found the following simple 
strategy to work well: approximate in (21) by regressing Xj on Xji alone, 
for j' = 1, ... ,j — 1, and then compute T^"-* using (21). Yet another alternative 
is to start from the diagonal estimator. 
The Algorithm: 

Step 0. Initialize f = f^°'> and f2(") = {f^o))Tf('^)_ 

Step 1. For each parameter tic, c = 1, . . . ,p,l = c, . . . ,p, solve "S/ ti^ft.k{T) = 
to find new tic- 

Step 2. Repeat Step 1 until convergence of T and set yC'+i) = T. 

Step 3. Set = (T'^k+i)-^Tj.{k+i) ^^^^ repeat Steps 1-3 until convergence 

otn. 

Steps 2 and 3 may seem redundant, but they are needed for two different 
reasons. Step 2 is needed because we only minimize with respect to one param- 
eter at a time, holding all other parameters fixed; and Step 3 is needed because 
of the quadratic approximation for After convergence, we replace entries 
in Ct with smaller magnitude than e with zero, using a fixed value of e = 10~®. 
Another approach with virtually the same performance is to replace entries of 
Cl^'^'f with e if their magnitude falls below e in Step 3, and use (25) directly in 
the objective function in Step 1 instead of using fe.k- 

In practice, we found that working with the correlation matrix as described in 
Theorem 2 is slightly better than working with the covariance matrix, although 
the differences are fairly small. Still, in all the numerical results we standard- 
ize the variables first and then rescale our estimate by the sample standard 
deviations of the variables. 

3.1. Algorithm convergence 

The convergence of the algorithm essentially follows from two standard re- 
sults. For the inner loop cycling through individual parameters, the value of 
the objective function decreases at each iteration, and the objective function 
is differentiable everywhere. Thus the inner loop of the algorithm converges by 
a standard theorem on cyclical coordinate descent for smooth functions (see, 
e.g., Bazaraa et al. (2006), p. 367), to a stationary point Vg{T) = 0, where 
g{T) = ff^k{T^T). The function f^^k is convex in the original parameters ujij, 
but since we reparametrized it in terms of T, the function g is not necessarily 
convex in T. In the next proposition we verify that this stationary point of g 
corresponds to the global minimum of the convex function /e.^. 

Proposition 1. Let f = /e.^ be the original convex function f approximated 
by the e-perturbed local quadratic approximation at iteration k, let T be a p x p 
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lower triangular matrix, and let g{T) = f{T^T). Let So be the unique solution 
to V/(iS') = 0, and let Tq be a solution to Wg{T) = 0. Then So = Tq To- 
Proof of Proposition 1. Let h : T T'^T. Note that h maps all of Rp(p+i)/2 
(all lower triangular matrices) into a convex subset of Rp(p+i)/2 (non-negative 
definite symmetric matrices). Denote the differential of h in the direction d € 
^p{p+i)/2 evaluated at to £ Kp(p+i)/2 by Vh{to)[d]. Then, 

Vh{to)[d]=T;^D + D^To, (26) 

where Tq and D are, respectively, to and d written as px p matrices. Now, using 
the chain rule and (26), we have 

Vg{to)[d] = Vf[vec[T^To)) {tJD + D^To) . (27) 

where we now think of / as a function from Rp(p+i)/2 to M. Since / is convex and 
has a unique minimizcr sq — vec(So), V/(s)[d] vanishes iff s = sq or d = 0. Thus 
Vg{to)[d] = vanishes iff Tg^To = S'o or T^D+D^Tq = 0, or T^ D = ~{T^ D)^ . 
If any diagonal elements of To are 0, then To is singular, and so is TqTq, and 
thus g{To) = CX3, so a singular Tq cannot be a stationary point of g. Since Tq is 
lower triangular and all its diagonal elements must be non-zero, one can show 
by induction that T^D = -{T^D^ implies D = 0. □ 

For the outer loop iterating through the quadratic approximation, we can 
apply the argument of Hunter and Li (2005) for e-perturbed local quadratic 
approximation obtained from general results for minorize-maximize algorithms, 
and conclude that as fc — > oo and e — *■ the algorithm converges to the global 
minimum of the original convex function / in (20). In practice, we have also 
observed that our algorithm and glasso converge to the same solution. 

3.2. Computational complexity 

The computational complexity of the algorithm in terms of p is 0{p'^), since 
each parameter update is at most 0{p) (see (32) in the Appendix), and there are 
0{p^) parameters. The only other algorithm for computing this estimator at the 
cost of O(p^) is glasso of Friedman et al. (2008); the algorithms of Yuan and Lin 

(2007) and d'Asprcmont et al. (2008) have higher computational cost. For ex- 
tensive timing comparisons of glasso and the algorithm of d'Asprcmont et al. 

(2008) , which showed convincingly that glasso is much faster, see Friedman et al. 
(2008). The exact timing also depends on the implementation, platform, etc (our 
algorithm is implemented in C and glasso in Fortran). Actual computing times 
we obtained for glasso and the SPICE algorithm are shown below in Figure 1, 
for model ^2 described in Section 4.1, with values of tuning parameters chosen 
as described in Section 3.3. 
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50 100 200 500 1000 

P 



Fig 1. Computing time in seconds vs p (log-log scale) for SPICE and glasso. 

3.3. Choice of tuning parameter 

Like any other penalty-based approach, SPICE requires selecting the tuning pa- 
rameter A. In simulations, we generate a separate validation dataset, and select 
A by maximizing the normal likelihood on the validation data with tl\ estimated 
from the training data. Alternatively, one can use 5-fold cross-validation, which 
we do for the real data analysis. There is some theoretical basis for selecting the 
tuning parameter in this way - see Bickel and Levina (2007). 

4. Numerical Results 

In this section, we compare the performance of SPICE to the shrinkage estimator 
of Ledoit and Wolf (2003) and to the sample covariance matrix when applicable 
{jp < n), using simulated and real data. We do not include any estimators that 
depend on variable ordering (such as banding of Bickel and Levina (2008) or the 
Lasso penalty on the Cholesky factor of Huang ct al. (2006)), nor estimators that 
focus on introducing sparsity in the covariance matrix itself rather than in its 
inverse (such as thresholding) , as they would automatically be at a disadvantage 
on sparse concentration matrices. The Ledoit-Wolf estimator does not introduce 
sparsity in the inverse either, but we use it as a benchmark for cases when p > n, 
since the sample covariance is not invertible. 

4.1. Simulations 

In simulations, we focus on comparing performance on sparse concentration ma- 
trices, with varying levels of sparsity. We consider the following four covariance 
models. 

1. ni: AR(1), a^j = CTI^'-^I. 

2. n^: AR(4), coy, = I(|j' - j\ = 0) + 0.4 • I(|j' - j\ = 1) 

+ 0.2 • 1(1/ - j\ = 2) + 0.2 • 1(1/ - j\ = 3) + 0.1 • 1(1/ - / = 4). 
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3. — B+SI, where each off-diagonal entry in B is generated independently 
and equals 0.5 with probability a = 0.1 or with probability 1 — a = 0.9. 
B has zeros on the diagonal, and S is chosen so that the condition number 
of Ha is p (keeping the diagonal constant across p would result in either 
loss of positive definiteness or convergence to identity for larger p) . 

4. fli. Same as fia except a = 0.5. 

All models are sparse (see Figure 2), and are numbered in order of decreasing 
sparsity (or increasing s). Note that the number of non-zero entries in 17i and fl2 
is proportional to p, whereas fl^ and have the expected number of non-zero 
entries proportional to p^. 

For all models, we generated n = 100 multivariate normal training obser- 
vations and a separate set of 100 validation observations. We considered five 
difi^erent values of 30, 100, 200, 500 and 1000. The estimators were computed 
on the training data, with the tuning parameter for SPICE selected by mini- 
mizing the normal likelihood on the validation data. Using these values of the 
tuning parameters, we computed the estimated concentration matrix on the 
training data and compared it to the population concentration matrix. 

We evaluate the concentration matrix estimation performance using the 
Kullback-Leibler loss, 



AklI^M) = tr - log j:n 



(28) 



Note that this loss is based on Cl and docs not require inversion to compute 
E, which is appropriate for a method estimating fi. The Kullback-Leibler loss 
was used by Yuan and Lin (2007) and Le\'iua et al. (2008) to assess perfor- 
mance of methods estimating fi, and is obtained from the standard entropy loss 
of the covariance matrix (Lin and Perlman, 1985; Wu and Pourahmadi, 2003; 
Huang et al., 200G) by reversing the roles of E and fl. 

Results for the four covariance models are summarized in Table 1, which 
reports the average loss and the standard error over 50 replications. For fii, O2, 
and fis, SPICE outperforms the Ledoit-Wolf estimator for all values of p. The 
sample covariance performs much worse than either estimator in all cases (for 

Table 1 

Simulations: Average (SE) Kullback-Leibler loss over 50 replications 



p 


Sample 


Ledoit-Wolf 


SPICE 


Sample 


Ledoit-Wolf 


SPICE 




Qi 




30 


8.52(0.14) 


3.49(0.04) 


1.61(0.03) 


8.52(0.14) 


2.77(0.02) 


2.55(0.03) 


100 


NA 


26.65(0.08) 


8.83(0.05) 


NA 


12.96(0.02) 


11.93(0.07) 


200 


NA 


76.83(0.13) 


21.23(0.09) 


NA 


28.16(0.01) 


24.82(0.07) 


500 


NA 


262.8(0.19) 


78.26(0.26) 


NA 


74.37(0.02) 


63.94(0.12) 


1000 


NA 


594.0(0.13) 


174.8(0.20) 


NA 


151.9(0.04) 


133.7(0.20) 




^3 


n4 


30 


8.45(0.12) 


3.50(0.05) 


2.12(0.04) 


8.45(0.12) 


3.04(0.04) 


3.77(0.04) 


100 


NA 


29.25(0.44) 


17.09(0.10) 


NA 


19.35(0.15) 


21.33(0.06) 


200 


NA 


86.93(1.64) 


45.58(0.13) 


NA 


53.18(0.37) 


51.93(0.13) 


500 


NA 


240.3(3.24) 


168.7(0.37) 


NA 


150.4(0.45) 


176.6(0.33) 


1000 


NA 


321.5(27.7) 


277.3(23.5) 


NA 


269.8(18.1) 


307.3(20.6) 
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(g) True ^4 




(h) SPICE Cii 



Fig 2. Heatmaps of zeros identified in the concentration matrix out of 50 replications. White 
color is 50/50 zeros identified, black is 0/50. 
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Table 2 

Percentage of correctly estimated non- zeros (TP %) and correctly estimated zeros (TN %) 
in the concentration matrix ( average and SE over 50 replications ) for SPICE 



p 


TP % TN % 


TP % TN % 






^2 


30 
100 
200 
500 
1000 


100(0.00) 68.74(0.31) 
100(0.00) 74.70(0.08) 
100(0.00) 73.57(0.04) 
100(0.00) 91.97(0.01) 
100(0.00) 98.95(0.00) 


50.18(1.44) 75.64(1.28) 
49.96(1.10) 72.68(1.21) 
27.62(0.12) 96.47(0.02) 
22.48(0.09) 98.81(0.00) 
22.29(0.05) 98.82(0.00) 








30 
100 
200 
500 
1000 


98.38(0.30) 63.85(1.28) 
93.90(0.27) 54.01(0.61) 
70.81(0.13) 69.82(0.05) 
28.93(0.06) 89.28(0.02) 
4.73(0.40) 72.36(6.13) 


74.15(0.61) 44.50(0.84) 
41.27(0.37) 63.07(0.36) 
35.77(0.06) 66.08(0.06) 
5.92(0.62) 94.27(0.61) 
2.07(0.14) 79.97(5.35) 



p = 30). For f74. the least sparse of the four models, the Lcdoit-Wolf estimator is 
about the same as SPICE (sometimes a httle better, sometimes a httle worse). 
This suggests, as we would expect from our bound on the rate of convergence, 
that SPICE provides the biggest gains in sparse models. 

To assess the performance of SPICE on recovering the sparsity structure 
in the inverse, we report percentages of true non-zeros estimated as non-zero 
(TP %) and percentages of true zeros estimated as zero (TN %) in Table 2. We 
also plot heatmaps of the percentage of time each element was estimated as zero 
out of the 50 replications in Figure 2, for p = 30 for all four models. In general, 
recovering the sparsity structure is easier for smaller p and for sparser models. 

Finally, some example computing times: the SPICE algorithm for 0,2 takes 
about 2 seconds for p ~ 200, 1 minute for p = 500, and 15 minutes for p = 
1000 on a regular PC. Glasso and SPICE both have complexity 0{p^), but 
because of the quadratic approximation, SPICE tends to require more iterations 
to converge, and on average, we have observed a difference in computing times 
on the order of about 10 between glasso and SPICE. However, this factor does 
not grow with p, and SPICE computing times are still very reasonable even for 
large p. 

4-2. Colon tumor classification example 

In this section, we compare performance of covariance estimators for LDA clas- 
sification of tumors using gene expression data from Alon et al. (1999). In this 
experiment, colon adenocarcinoma tissue samples were collected, 40 of which 
were tumor tissues and 22 non-tumor tissues. Tissue samples were analyzed us- 
ing an Affymetrix oligonucleotide array. The data were processed, filtered, and 
reduced to a subset of 2,000 gene expression values with the largest minimal 
intensity over the 62 tissue samples. Additional information about the dataset 
and pre-processing can be found in Alon et al. (1999). 

To assess the performance at different dimensions, we reduce the full dataset 
of 2,000 gene expression values by selecting p most significant genes as mea- 



A.J. Rothman et al. /Sparse covariance estimation 



511 



Table 3 

Averages and SEs of classification errors in % over 100 splits. Tuning parameter for 
SPICE chosen by (A): 5-fold CV on the training data maximizing the likelihood; (B): 5-fold 
CV on the training data minimizing the classification error; (C): minimizing the 
classification error on the test data 





p = 50 


p = 100 


p = 200 


N. Bayes 




15.8(0.77) 


20.0(0.84) 


23.1(0.96) 


L-W 




15.2(0.55) 


16.3(0.71) 


17.7(0.61) 


SPICE 


A 


12.1(0.65) 


18.7(0.84) 


18.3(0.66) 


SPICE 


B 


14.7(0.73) 


16.9(0.85) 


18.0(0.70) 


SPICE 


C 


9.0(0.57) 


9.1(0.51) 


10.2(0.52) 



sured by the two-sample i-statistic, for p — 50, 100, 200. Then we use Unear 
discriminant analysis (LDA) to classify these tissues as cither tumorous or non- 
tumorous. We classify each test observation x to either class fc = or fc = 1 
using the LDA rule 

Sk{x) = argmax jaJ^f^Afc " ^Al^^Afc + log^fcj , (29) 

where tti; is the proportion of class k observations in the training data, /ij, is 
the sample mean for class k on the training data, and f2 is an estimator of the 
inverse of the common covariance matrix on the training data computed by one 
of the methods under consideration. Detailed information on LDA can be found 
in Mardia ct al. (1979). 

To create training and test sets, we randomly split the data into a train- 
ing set of size 42 and a testing set of size 20; following the approach used by 
Wang et al. (2007), we require the training set to have 27 tumor samples and 15 
non-tumor samples. We repeat the split at random 100 times and measure the 
average classification error. The average errors with standard errors over the 100 
splits are presented in Table 3. We omit the sample covariance because it is not 
invertible with such a small sample size, and include the naive Bayes classifier 
instead (where S is estimated by a diagonal matrix with sample variances on 
the diagonal). Naive Bayes has been shown to perform better than the sample 
covariance in high-dimensional settings (Bickcl and Lcvina, 2004). 

For an application such as classification, there are several possibilities for 
selecting the tuning parameter. Since we have no separate validation data avail- 
able, we perform 5-fold cross-validation on the training data. One possibility 
(columns A in Table 3) is to continue using normal likelihood as a criterion for 
cross-validation, like we did in simulations. Another possibility (columns B in 
Table 3) is to use classification error as the cross-validation criterion, since that 
is the ultimate performance measure in this case. Table 3 shows that for SPICE 
both methods of tuning perform similarly. For reference, we also include the best 
error rate achievable on the test data, which is obtained by selecting the tuning 
parameter to minimize the classification error on the test data (columns C in 
Table 3). SPICE provides the best improvement over naive Bayes and Ledoit- 
Wolf for p = 50; for larger p, as less informative genes are added into the pool, 
the performance of all methods worsens. 
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5. Discussion 

We have analyzed a penalized likelihood approach to estimating a sparse con- 
centration matrix via a lasso-type penalty, and showed that its rate of conver- 
gence depends explicitly on how sparse the true matrix is. This is analogous 
to results for banding (Bickel and Lcvina, 2008) , where the rate of convergence 
depends on how quickly the off-diagonal elements of the true covariance de- 
cay, and for thresholding (Bickel and Lcvina, 2007; El Karoui, 2007), where the 
rate also depends on how sparse the true covariance is by various definitions of 
sparsity. We conjecture that other structures can be similarly dealt with, and 
other types of penalties may show similar behavior when applied to the "right" 
type of structure - for example, a ridge, bridge, or other more complex penalty 
may work well for a model that is not truly sparse but has many small entries. 
A generalization of this work to other penalties has been recently completed 
by Lam and Fan (2007), who have also proved "sparsistency" of SPICE- type 
estimators. 

While we assumed normality, it can be replaced by a tail condition, analo- 
gously to Bickel and Lcvina (2008). The use of normal likelihood is, of course, 
less justifiable if we do not assume normality, but it was found empirically that 
it still works reasonably well as a loss function even if the true distribution is 
not normal (L{>vina et al., 2008). 

The Cholesky decomposition of covariance was only considered appropriate 
when variables are ordered, and we have shown it to be a useful tool for enforc- 
ing positive definiteness of the estimator even when variables have no natural 
ordering. Our optimization algorithm has complexity of 0{p'^) and is equally 
applicable to general Iq penalties. 
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Appendix A: Derivation of the Algorithm 

In this section we give a full derivation of the parameter update equations 
involved in the optimization algorithm. Recall that we have re-parametrized the 
objective function (20) using (22)-(24). We cycle through the parameters in T 
and for each tic, compute partial derivatives with respect to tic while holding all 
other parameters fixed, and solve the univariate linear equation corresponding 
to setting this partial derivative to 0. 
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For simplicity, we separate the likelihood and the penalty by writing /(T) = 
£{T) + P{T). We also suppress the e-perturbation in the denominator for sim- 
plicity of notation. For the likelihood part, taking the partial derivative with 
respect to Uc, 1 < c < c < I < p gives 

= -2 1- 1 log,, 4 1 1; E (e ' 

j=l i=l j = l \k=l / 



=0 if j^c 



=0 if 



-2 \ - 

— 1{/ = c} + tic [2(7cc] +2 2^ tikakc, 

'^'^ k^c 



(30) 



For the penalty part, using the quadratic approximation (25) gives 



d 



E 



Xq 



d 

du/^^^ ^dUc^. 



E 



J >J J ■' k=l.k^c 



(31) 



since the only nonzero terms in (31) are those for which j' < I and either j' = c 
or j = c. For I < k < I such that fc ^ c, we have = 2ujcktik, and 

collecting terms together we get 



''Ik 



k=l,k=tc '^cfcl 



(Wcfe — tlctlk)tlk 



(32) 



Combining together (30) and (32), we have the parameter update equation 
for tic when / 7^ c, is given by 

~ _ -J2i=l. k^c'^lk^kc - MY!k=l,k=tci^ck ~ tlckk)tlk\ujlkV~'^ 
^cc + MT!k=l,k^ctlW~' 

If Z = c, we solve au^ + — 1 = for u using the quadratic formula, where 



a^acc + Xq *?fckcfer"', 

k=l,k^c 
I I 

Y tlkCTkc + M Y i^ck - tlctlk)tlk\u!ck\''~'^ ^ 
k—1. k^c k—l,k^c 

then take the positive solution tec ~ ■ 

We also can quickly update the ujck involving tic via 

Wcfc = i^^ck + tlkitlc — tic)- 
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